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Abstract.  Believable  nonverbal  behaviors  for  embodied  conversational 
agents  (ECA)  can  create  a  more  immersive  experience  for  users  and  im¬ 
prove  the  effectiveness  of  communication.  This  paper  describes  a  nonver¬ 
bal  behavior  generator  that  analyzes  the  syntactic  and  semantic  structure 
of  the  surface  text  as  well  as  the  affective  state  of  the  ECA  and  anno¬ 
tates  the  surface  text  with  appropriate  nonverbal  behaviors.  A  number  of 
video  clips  of  people  conversing  were  analyzed  to  extract  the  nonverbal 
behavior  generation  rules.  The  system  works  in  real-time  and  is  user- 
extensible  so  that  users  can  easily  modify  or  extend  the  current  behavior 
generation  rules. 


1  Introduction 

Nonverbal  behaviors  serve  to  repeat,  contradict,  substitute,  complement,  accent, 
or  regulate  spoken  communication  [T].  They  can  include  facial  expressions,  head 
movements,  body  gesture,  body  posture,  or  eye  gaze.  Nonverbal  behaviors  can 
also  be  affected  by  a  range  of  affective  phenomena.  For  example,  an  angry  per¬ 
son  might  display  lowered  eyebrows  and  tensed  lips  and  more  expressive  body 
gestures  than  one  who  is  not.  Such  behavior  can  in  turn  influence  the  beliefs, 
emotions,  and  behavior  of  observers. 

Embodied  conversational  agents  (ECA)  with  appropriate  nonverbal  behaviors 
can  support  interaction  with  users  that  ideally  mirrors  face-to-face  human  inter¬ 
action.  Nonverbal  behaviors  also  can  help  create  a  stronger  relationship  between 
the  ECA  and  user  as  well  as  allow  applications  to  have  richer,  more  expres¬ 
sive  characters.  Overall,  appropriate  nonverbal  behaviors  should  provide  users 
with  a  more  immersive  experience  while  interacting  with  ECAs,  whether  they 
are  characters  in  video  games,  intelligent  tutoring  systems,  or  customer  service 
applications  [5]. 

This  paper  describes  our  approach  for  creating  a  nonverbal  behavior  gener¬ 
ator  module  for  ECAs  that  assigns  behaviors  to  the  ECA’s  utterances.  We  are 
especially  interested  in  an  approach  that  generates  nonverbal  behaviors  provided 
only  the  surface  text  and,  when  available,  the  ECA’s  emotional  state,  turn-taking 
strategy,  coping  strategy,  and  overall  communicative  intent.  In  general,  we  seek 
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Fig.  1.  SASO’s  SmartBody 

a  robust  process  that  does  not  make  any  strong  assumptions  about  markup  of 
communicative  intent  in  the  surface  text.  Often  such  markup  is  not  available 
unless  entered  manually.  Even  in  systems  that  use  natural  language  generation 
to  create  the  surface  text  (e.g.,  Stabilization  and  Support  Operations  system 
0),  the  natural  language  generation  may  not  pass  down  detailed  information 
about  how  parts  of  the  surface  text  (a  phrase  or  word,  for  example)  convey  spe¬ 
cific  aspects  of  the  communicative  intent  or  emotional  state.  As  a  result,  the 
nonverbal  behavior  generator  often  lacks  sufficiently  detailed  information  and 
must  rely  to  varying  degrees  on  analyzing  the  surface  text.  Therefore,  a  key 
interest  here  is  whether  we  can  extract  information  from  the  lexical,  syntactic, 
and  semantic  structure  of  the  surface  text  that  can  support  the  generation  of 
believable  nonverbal  behaviors. 

Our  nonverbal  behavior  generator  has  been  incoimorated  into  SmartBody,  an 
EGA  developed  at  University  of  Southern  California!)].  SmartBody  project  is  part 
of  the  Stabilization  and  Support  Operations  (SASO)  research  prototype,  which 
grew  out  of  the  Mission  Rehearsal  Environment  [3]  to  teach  leadership  and  ne¬ 
gotiation  skills  under  high  stress  situations.  In  this  system,  the  trainees  interact 
and  negotiate  with  life-size  EGA  that  reside  in  a  virtual  environment.  Figure  1 
shows  SmartBody,  in  this  case  a  doctor,  whom  the  trainee  interacts  with. 

The  next  section  describes  related  works.  Section  three  describes  research 
on  nonverbal  behavior  and  our  analysis  of  video  clips  to  derive  the  nonverbal 
behavior  generation  rules.  Section  four  describes  the  system  architecture  of  the 
nonverbal  behavior  generator  and  an  example  that  walks  through  the  behavior 
generation  process.  We  also  discuss  the  extensibility  of  the  nonverbal  behavior 
generator  and  propose  directions  for  future  work. 


^  This  is  a  joint  work  of  USC  Information  Sciences  Institute  and  USC  Institute  for 
Creative  Technologies. 
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2  Related  Work 

Mirroring  the  studies  of  nonverbal  behavior  in  human  communication,  EGA 
research  has  shown  that  there  is  a  significant  improvement  in  the  user’s  level 
of  engagement  while  interacting  with  EGA  that  displayed  believable  nonverbal 
behaviors.  The  work  of  Fabri  et  al.  [5]  suggests  that  EGA  with  expressive  abilities 
can  increase  the  sense  of  togetherness  or  community  feeling.  Durlach  and  Slater 
[4]  observed  that  EGA  with  even  primitive  nonverbal  behaviors  generate  strong 
emotional  responses  from  the  users. 

The  effort  to  construct  expressive  EGA  ranges  from  animating  human  faces 
with  various  facial  expressions  to  generating  complex  body  gestures  that  convey 
emotions  and  communicative  intent.  Rea  [S]  engages  in  a  face-to-face  interac¬ 
tion  with  a  user  and  models  the  intention  and  communicative  intention  of  the 
agent  to  generate  appropriate  facial  expressions  and  body  gestures.  Becheiraz 
and  Thalmann  developed  a  behavioral  animation  system  for  virtual  charac¬ 
ters  by  modeling  the  emergence  conditions  for  each  character’s  personality  and 
intentions.  Striegnitz  et  al.  [7]  developed  an  EGA  that  autonomously  generates 
hand  gestures  while  giving  directions  to  the  user. 

There  has  also  been  work  that  emphasizes  the  reusability  of  the  nonverbal  be¬ 
havior  generators  by  separating  the  concept  of  behavior  generation  and  behavior 
realization.  The  BEAT  [5]  system  is  a  plug-in  model  for  nonverbal  behavior  gen¬ 
eration  that  extracts  the  linguistic  structure  of  the  text  and  suggests  appropriate 
nonverbal  behaviors.  It  allows  users  to  add  new  entries  to  extend  the  gesture  li¬ 
brary  or  modify  strategies  for  generating  or  filtering  out  the  behaviors. 

beat’s  functions  and  purpose  very  much  informed  our  work;  however,  there 
are  several  differences.  We  are  crafting  our  system  around  the  new  BML  and 
FML  standards  [3].  This  should  provide  a  clearer,  more  general  and  standard¬ 
ized  interface  for  communicative  intent  and  behavior  specification.  BEAT  had 
a  variety  of  pre-knowledge  about  the  surface  text  to  be  delivered  at  different 
abstraction  levels,  which  is  not  the  case  in  our  nonverbal  behavior  generator. 
We  are  interested  in  exploring  the  degree  to  which  nonverbal  behavior  genera¬ 
tor  can  work  only  with  the  surface  text  and  a  minimal  set  of  specification  on 
the  communicative  intent  at  a  high  level  of  abstraction  such  as  the  turn-taking 
information  and  the  affective  state.  We  are  also  exploring  a  different  range  of 
expressive  phenomena  that  is  complementary  to  BEAT’S  work.  Specifically,  we 
are  analyzing  videos  of  emotional  dialogues.  Finally,  BEAT  included  a  commer¬ 
cial  language  tagger,  while  we  are  planning  to  maintain  our  nonverbal  behavior 
generator  open-source. 

3  Study  of  Nonverbal  Behaviors 

3.1  Nonverbal  Behaviors  and  Their  Functionalities 

There  is  a  large  research  literature  on  the  functionalities  of  nonverbal  behaviors 
during  face-to-face  communication  [in]  HB  na  US]  [a.  Heylen  [12]  summa¬ 
rizes  the  functions  of  head  movements  during  conversations.  Some  included  are: 
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to  signal  yes  or  no,  enhance  communicative  attention,  anticipate  an  attempt  to 
capture  the  floor,  signal  the  intention  to  continue,  mark  the  contrast  with  the 
immediately  preceding  utterances,  and  mark  uncertain  statements  and  lexical  re¬ 
pairs.  Kendon  [T3]  describes  the  different  contexts  in  which  the  head  shake  may 
be  used.  Head  shake  is  used  with  or  without  verbal  utterances  as  a  component  of 
negative  expression,  when  a  speaker  makes  a  superlative  or  intensifled  expression 
as  in  ‘very  very  old’,  when  a  speaker  self-corrects  himself,  or  to  express  doubt 
about  what  he  is  saying.  In  [14],  lateral  sweep  or  head  shakes  co-occurs  with  con¬ 
cepts  of  inclusivity  such  as  ‘everyone’  and  ‘everything’  and  intensification  with 
lexical  choices  such  as  ‘very’,  ‘a  lot’,  ‘great’,  ‘really’.  Side-to-side  shakes  also  cor¬ 
relate  with  expressions  of  uncertainty  and  lexical  repairs.  During  narration,  head 
nods  function  as  signs  of  affirmation  and  backchannel  requests  to  the  speakers. 
Speakers  also  predictably  change  the  head  position  for  alternatives  or  items  in 
a  list.  Ekman  [lOj  describes  eyebrow  movements  for  emotional  expressions  and 
conversational  signals.  Some  examples  are  eyebrow  raise  or  frowning  to  accent  a 
particular  word  or  to  emphasize  a  particular  conversation  point.  One  of  the  goals 
for  our  nonverbal  behavior  generator  is  to  And  features  in  the  dialogue  that  con¬ 
vey  these  attributes  and  annotate  them  with  appropriate  nonverbal  behaviors 
that  are  consistent  with  the  research  literature.  Although  the  above  discussion 
is  couched  in  general  terms,  nonverbal  behaviors  vary  across  cultures  and  even 
individuals.  We  return  to  this  issue  later. 

3.2  Video  Data  Analysis 

In  addition  to  the  existing  research  literature,  we  have  also  studied  the  uses  of 
nonverbal  behaviors  in  video  clips  of  people  conversing.  The  literature  is  useful 
for  broadly  classifying  the  behaviors.  However,  to  better  assess  whether  it  is 
feasible  to  build  behavior  generation  rules  that  could  map  from  text  to  behavior, 
an  analysis  of  actual  conversations  was  needed. 

We  obtained  video  clips  of  users  interacting  with  the  Sensitive  Artificial  Lis¬ 
tener  system  from  the  Human-Machine  Interaction  Network  on  Emotion  [El- 
Sensitive  Artificial  Listener  (SAL)  is  a  technique  to  engage  users  in  emotionally 
colored  interactive  discourse  [16].  SAL  is  modeled  on  an  ELIZA  scenario  [T7],  a 
computer  emulation  of  a  psychotherapist.  In  SAL,  the  operator  plays  the  role 
of  one  of  four  characters  with  different  personalities  and  responds  to  the  user 
with  pre-deflned  scripts.  The  main  goal  is  to  pull  the  user’s  emotion  towards  the 
character’s  emotional  state. 

17  video  clips  were  analyzed,  each  ranging  from  five  to  ten  minutes  in  length. 
The  video  clips  capture  only  the  users’  torso  and  above,  and  we  mainly  annotated 
the  facial  expressions  and  head  movements  exhibited  by  the  users.  For  each  video 
clip,  we  annotated  the  types  of  nonverbal  behaviors  portrayed,  their  frequency, 
time  frame,  spoken  utterance,  and  the  users’  emotional  states  when  the  behavior 
occurred.  This  was  documented  in  an  XML  form  for  easy  parsing  and  processing. 

There  were  a  number  of  different  nonverbal  behaviors  observed  in  these  video 
clips.  These  behaviors  include: 
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—  Head  Movement:  nods,  shakes,  head  moved  to  the  side,  head  tilt,  pulled 
back,  pulled  down 

—  Eyebrow  Movement:  brow  raised,  brow  lowered,  brow  flashes 

—  Eye/Gaze  Movement:  look  up,  look  down,  look  away,  eyes  squinted,  eyes 
squeezed,  eyes  rolled 

—  Others:  shoulder  shrug,  mouth  pulled  on  one  side 

To  annotate  the  utterances,  we  adopted  the  labels  used  in  the  literature  and 
created  a  few  more  for  the  utterances  in  which  we  observed  a  nonverbal  behav¬ 
ior  but  no  appropriate  labels  were  used  in  the  literature.  The  labels  used  are 
affirmation,  negation,  contrast,  intensification,  inclusivity,  obligation,  listing,  as¬ 
sumption,  possibility,  response  request,  and  word  search.  For  each  utterance 
accompanying  nonverbal  behaviors,  we  attached  the  labels  applicable  to  the 
utterance  and  annotated  the  behaviors.  There  were  161  utterances  that  were 
annotated  using  these  labels.  Table  1  shows  the  distribution  of  the  number  of 
utterances  that  includes  each  label. 


Table  1.  Breakdown  of  the  number  of  utterances  with  corrsponding  labels 


Label 

^  of  utterances 
(out  of  161) 

Label 

^  of  utterances 
(out  of  161) 

Affirmation 

39 

Response  Reqeust 

9 

Negation 

62 

Inclusivity 

7 

Intensification 

41 

Obligation 

6 

Word  Search 

25 

Assumption 

3 

Contrast 

9 

A  number  of  utterances  were  annotated  with  two  or  more  labels,  which  is  why 
the  sum  of  each  component  exceeds  161.  Besides  these  161  utterances,  there  were 
58  utterances  that  accompanied  nonverbal  behaviors  but  could  not  be  labeled 
appropriately  because  there  was  not  a  clear  and  consistent  pattern  between 
the  utterance  and  the  behaviors.  The  nonverbal  behaviors  on  these  utterances 
were  usually  observed  at  the  beginning  of  the  sentence  or  when  the  user  was 
emphasizing  a  particular  word  or  context,  but  the  behaviors  varied  in  each  case. 

In  general,  we  found  a  close  match  between  the  literature  and  our  video  anal¬ 
ysis  on  the  mappings  of  nonverbal  behaviors  to  certain  utterances.  For  example, 
a  head  shake  usually  occurred  when  a  word  with  inclusive  meaning  such  as  ‘all’ 
and  ‘everything’  was  spoken  and  lowered  eyebrow  with  a  head  nod  or  shake  oc¬ 
curred  when  intensifying  words  like  ‘really’  was  spoken.  We  also  analyzed  the 
parse  trees  of  the  utterances  and  found  mappings  between  certain  behaviors  and 
syntactic  structures.  Interjections,  which  were  usually  associated  with  the  words 
‘yes’,  ‘no’,  and  ‘well’  in  the  video  clips  accompanied  either  a  head  nod,  shake,  or 
tilt  in  most  cases. 

Based  on  the  study  from  the  literature  and  our  video  analysis,  we  created 
a  list  of  nonverbal  behavior  generation  rules,  which  are  described  in  Figure  2. 
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(1)  INTERJECTION:  Head  nod,  shake,  or  tilt  co-occurring  with  these  words: 

-  Yes,  no,  well 

(1)  NEGATION:  Head  shakes  and  brow  frown  throughout  the  whole  sentence  or 
phrase  these  words  occur: 

-  No,  not,  nothing,  can’t,  cannot 

(2)  AFFIRMATION:  Head  nods  and  brow  raise  throughout  the  whole  sentence  or 
phrase  these  words  occur: 

-  Yes,  yeah,  I  do,  I  am,  We  have,  We  do,  You  have,  true,  OK 

(3)  ASSUMPTION  /  POSSIBILITY:  Head  nods  throughout  the  sentence  or  phrase 
and  brow  frown  when  these  words  occur: 

-  I  guess,  I  suppose,  1  think,  maybe,  perhaps,  could,  probably 

(3)  OBLIGATION:  Head  nod  once  co-occurring  with  these  words: 

-  Have  to,  need  to,  ought  to 

(4)  CONTRAST:  Head  moved  to  the  side  (lateral  movement)  and  brow  raise 
co-occurring  with  these  words: 

-  But,  however 

(4)  INCLUSIVITY:  Lateral  head  sweep  co-occurring  on  these  words: 

-  Everything,  all,  whole,  several,  plenty,  full 

(4)  INTENSIFICATION:  Head  nod  and  brow  frown  co-occurring  with  these  words: 

-  Really,  very,  guite,  completely,  wonderful,  great,  absolutely,  gorgeous,  huge,  fantastic, 
so,  amazing,  just,  quite,  important,  .  .  . 

(4)  LISTING:  Head  moved  to  the  side  (lateral  movement)  and  to  the  other  before  and 
after  the  word  ‘and’: 

-  X  and  Y 

(4)  RESPONSE  REQUEST:  Head  moved  to  the  side  and  brow  raise  co-occurring  with 
these  words: 

-  You  know 

(4)  WORD  SEARCH:  Head  tilt,  brow  raise,  gaze  away  co-occurring  with  these  words: 

-  Um,  uh,  well 

Fig.  2.  Nonverbal  behavior  generation  rules.  The  numbers  in  the  parenthesis  indicates 
the  priority  or  each  rule. 

Each  rule  has  associated  nonverbal  behaviors  and  a  set  of  words  that  are  usu¬ 
ally  spoken  when  the  nonverbal  behavior  is  exhibited.  We  also  defined  a  priority 
value  for  each  rule  based  on  our  analysis  to  resolve  conflicts  between  rules  that 
could  co-occur.  For  example,  in  the  utterance  ‘Maybe  we  shouldn’t  do  that’, 
both  the  assumption  rule  and  the  negation  rule  could  be  applied.  However,  the 
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video  analysis  tells  us  that  the  negation  rule  overrides  the  assumption  rule  in 
those  cases.  In  general,  the  nonverbal  behavior  rules  that  occur  over  the  whole 
sentence  or  phrases  overrule  those  that  occur  on  a  single  word. 

Following  are  examples  on  how  the  rules  are  applied  to  given  surface  texts. 

Example  1 

Surface  Text: 

I  do,  I  do.  I’m  looking  forward  to  that  but  I  can’t  rest  until  I  get  this  work 
done. 

Rules  applied: 

Affirmation  rule  from  I  do  and  I’m 
Negation  rule  from  can’t 

(Contrast  rule  applied  from  but  is  overridden  by  the  negation  rule) 

Nonverbal  Behaviors: 

Head  nods  on  I  do,  I  do  and  I’m  looking  forward 
Head  shakes  on  I  can’t  rest 

Example  2 

Surface  Text: 

Yes,  Prudence,  many  times.  I  actually  quite  like  you. 

Rules  applied: 

Interjection  rule  from  yes 
Intensification  rule  from  quite 
Nonverbal  Behaviors: 

Head  nod  on  yes 
Head  nod  on  quite 

In  addition  to  the  nonverbal  behaviors  associated  with  certain  dialogue  elements, 
we  also  put  small  head  nods  on  phrasal  boundaries.  This  is  based  on  our  expe¬ 
rience  that  it  makes  the  EGA  more  life-like,  perhaps  because  the  human  head  is 
often  in  constant  (small)  motion  as  a  person  talks. 

The  next  section  describes  how  we  use  these  rules  to  create  execution  com¬ 
mands  for  believable  nonverbal  behaviors. 

4  System  Architecture 

4.1  Overview 

The  nonverbal  behavior  generator  is  built  to  be  modular  and  to  operate  in 
real  time  with  user-extensible  behavior  generation  rules.  The  input  and  output 
interaction  to  the  system  is  done  by  a  message  pipeline  system,  and  the  main 
data  structure  for  the  inputs  and  outputs  is  in  XML  form.  More  specifically,  we 
are  using  Function  Markup  Language  (FML)  and  Behavior  Markup  Language 
(BML)  as  part  of  the  input  and  output  messages  (see  the  next  section  for  more 
details  on  FML  and  BML).  The  nonverbal  behavior  generator  uses  two  major 
tools  to  select  and  schedule  behaviors:  a  natural  language  parser  and  an  XML 
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Fig.  3.  System  architecture  of  the  nonverbal  behavior  generator 

stylesheet  transformation  (XSLT)  processor.  XSL  is  a  language  to  transform 
XML  documents  into  other  XHTML  documents  or  XML  documents.  In  our 
case,  we  will  be  transforming  the  input  XML  string  by  inserting  time  markers 
to  the  surface  utterance  and  behavior  execution  codes.  The  nonverbal  behavior 
generation  rules  are  also  represented  in  XSL  format. 

Figure  3  illustrates  the  overview  of  the  system’s  structure.  The  nonverbal  be¬ 
havior  generator’s  input  XML  string  contains  the  surface  text  of  the  agent  as 
well  as  other  affective  information  such  as  the  agent’s  emotional  state,  emphasis 
point,  and  coping  strategy.  The  NVBGenerator  module  parses  this  XML  mes¬ 
sage,  registers  the  agent’s  affective  information,  and  extracts  the  surface  text. 
The  surface  text  is  then  sent  to  the  natural  language  parser  to  obtain  the  syn¬ 
tactic  structure  of  the  utterance.  Given  the  parsed  result  of  the  utterance  and 
the  behavior  generation  rules,  the  NVBGenerator  selects  the  appropriate  behav- 
ior(s).  The  selected  behaviors  are  then  customized  and  modified  by  the  affective 
information  of  the  agent.  Finally,  the  execution  code  for  the  chosen  behavior(s) 
are  generated  and  sent  out  to  the  virtual  human  controller.  The  following  sec¬ 
tions  describe  parts  of  the  processing  steps  in  greater  detail. 

4.2  Function  Markup  Language  and  Behavior  Markup  Language 

The  Social  Performance  Framework  [18]  and  more  recently  SAIBA  [19]  are 
being  developed  to  modularize  the  design  and  research  of  embodied  conversa¬ 
tional  agents.  These  frameworks  define  modules  that  make  clear  distinction  be¬ 
tween  the  communicative  intent  and  behavior  descriptions  of  the  EGA  with  XML 
based  interfaces.  This  distinction  is  defined  by  two  markup  languages  FML  and 
BML,  which  consolidate  a  range  of  prior  work  in  markup  languages  (such  as  the 
Affective  Presentation  Markup  Language  [20]  and  Multimodal  Utterance  Repre¬ 
sentation  Markup  Language  m)-  Function  Markup  Language  (FML)  specifies 
the  communicative  and  expressive  intent  of  the  agent  and  will  be  part  of  the 


Nonverbal  Behavior  Generator  for  Embodied  Conversational  Agents 


251 


input  message  to  our  nonverbal  behavior  generator.  The  following  describes  some 
of  the  elements  defined  in  FML. 

—  AFFECT:  The  affective  state  of  the  speaker  (e.g.  JOY,  DISTRESS,  RE¬ 
SENTMENT,  FEAR,  ANGER...). 

—  COPING:  Identification  of  a  coping  strategy  employed  by  the  speaker. 

—  EMPHASIS:  Speaker  wants  listeners  to  pay  particular  attention  to  this  part 
of  the  spoken  text. 

—  TURN:  Management  of  speaking  turns  (TAKE,  GIVE,  KEEP). 

Behavior  Markup  Language  (BML),  on  the  other  hand,  describes  the  verbal 
and  nonverbal  behaviors  an  agent  will  execute.  The  elements  of  BML  roughly 
correspond  to  the  parts  of  human  body  and  the  attributes  of  each  element  further 
define  the  details  of  specific  behavior  execution  information  such  as  the  start  and 
end  time  and  the  frequency  of  the  behavior.  The  set  of  elements  defined  in  BML 
includes, 

—  HEAD:  Movement  of  the  head  independent  of  eyes. 

—  FACE:  Movement  in  the  face. 

~  GAZE:  Coordinated  movement  of  the  eyes,  neck,  and  torso,  indicating  where 
the  character  is  looking. 

—  BODY:  General  movement  of  the  body. 

—  GESTURE:  Coordinated  movement  with  arms  and  hands. 

—  SPEECH:  Spoken  delivery. 

—  LIPS:  Movement  of  the  mouth. 

—  ANIMATION:  Plays  back  a  character  animation  clip. 

The  selected  behaviors  from  our  nonverbal  behavior  generator  are  encoded 
using  these  BML  tags  and  be  included  in  the  output  message.  Incorporating  FML 
and  BML  to  specify  the  communicative  intent  and  the  nonverbal  behaviors  of 
the  agent  not  only  gives  the  structural  format  to  express  these  information,  but 
allows  the  developer  to  easily  process  the  information  using  any  XML  processor, 
which  is  widely  available. 

4.3  Nonverbal  Behavior  Generation  Process 

Let’s  have  a  closer  look  at  how  the  nonverbal  behaviors  are  selected  and  gener¬ 
ated.  Assume  the  input  message  to  the  generator  contains  the  following  infor¬ 
mation. 

~  Surface  text: 

Yes,  I  completely  agree.  I  am  not  interested  only  in  myself,  you  know. 

—  Emphasis:  Emphasis  on  myself 

—  Affect:  Neutral 

The  NVBGenerator  first  parses  the  input  message,  extracts  the  surface  string, 
and  sends  it  to  the  natural  language  parser.  We  are  currently  using  Charniak’s 
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Fig.  4.  Nonverbal  behaviors  animated  on  SmartBody 

parser  to  process  the  utterance.  The  parse  tree  is  sent  back  and  the  NVB- 
Generator  inserts  time  markers  between  every  word  of  the  utterance.  Then  the 
NVBGenerator  analyzes  the  semantic  and  syntactic  structure  of  the  utterance 
to  decide  which  rules  could  be  fired  and  inserts  XML  tags  for  such  rules.  The 
XSLT  processor  looks  at  these  rule  tags  and  matches  them  to  insert  the  BML 
codes  into  the  output  message.  But  if  there  are  two  rules  that  overlap  with  each 
other,  the  one  with  a  higher  priority  will  be  selected. 

In  the  example  above,  the  rules  that  apply  to  the  given  surface  text  will 
be,  interjection  rule,  which  creates  BML  codes  for  a  head  nod  on  the  word 
‘Yes’,  intensification  rule,  which  puts  a  head  nod  and  lowered  brow  movement 
on  the  word  ‘completely’,  negation  rule,  which  puts  head  shakes  on  ‘I  am  not 
interested’,  first  noun  phrase  rule,  which  puts  a  small  head  nod  after  ‘myself’,  and 
the  response  request  rule,  which  puts  head  nod  after  ‘you  know’.  Since  there  is  an 
emphasis  on  the  word  ‘myself’,  the  NVBGenerator  will  replace  the  medium  head 
nod  to  a  big  nod  and  insert  lowered  brow  movement  when  ‘myself’  is  spoken.  The 
SmartBody  system  also  has  a  number  of  pre-animated  gesture  clips  that  could  be 
used  in  place  of  the  BML  codes.  For  example,  we  have  an  animation  clip  where 
the  EGA  puts  his  hand  up  and  shakes  his  head,  which  could  be  used  when  the 
negation  rule  is  selected  instead  of  outputting  a  BML  code  for  head  shake.  Figure 
4  shows  examples  of  some  nonverbal  behaviors  animated  on  SmartBody.  Finally, 
the  output  message  consisting  of  the  surface  text  with  time  markers  and  BML 
codes  are  sent  to  the  SmartBody  controller  that  synchronizes  and  animates 
the  nonverbal  behaviors. 

4.4  Extensibility  and  Specialization 

The  nonverbal  behavior  generator  has  been  designed  for  easy  extension  for  the 
users.  As  mentioned  in  section  4.1,  the  nonverbal  behavior  generation  rules  are 
represented  in  XSL  format.  There  is  one  file  that  stores  the  behavior  descriptions 
for  different  nonverbal  behaviors  and  another  file  that  stores  the  association 
between  the  rules  and  the  nonverbal  behaviors.  More  specifically,  the  behavior 
description  file  stores  the  BML  codes  for  different  behaviors  such  as  bigjieadjnod, 
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smalLhead-shake,  and  brow-frown  and  the  behavior  generation  rule  file  stores  the 
information  on  which  behaviors  should  be  generated  for  each  rule.  For  example, 
when  intensification  rule  is  applied,  a  small  head  nod  and  brow  frowning  should 
occur.  As  described  in  section  4.2,  the  whole  behavior  generation  process  is  done 
in  three  steps;  first  the  NVBGenerator  analyzes  the  surface  text  and  inserts  an 
XML  tag  for  the  appropriate  rule.  Then  the  behavior  generation  rule  file  matches 
this  tag  to  see  which  behaviors  should  occur,  and  finally  the  appropriate  BML 
codes  stored  in  the  behavior  description  file  is  inserted  to  the  output  message. 

The  separation  between  behavior  descriptions  and  nonverbal  behavior  genera¬ 
tion  rules  allows  easy  modification  and  extension  without  affecting  one  another. 
For  example,  it  is  simple  to  add  new  entries  of  gesture  animations  or  behavior 
descriptions  into  the  system.  As  the  animator  creates  new  gesture  animations 
or  a  programmer  creates  a  new  procedural  behavior,  one  can  simply  extend  the 
behavior  description  file  to  add  the  name  of  the  animations  or  behavior  descrip¬ 
tion  for  future  use.  It  is  also  easy  to  modify  the  rules  that  invoke  the  behavior 
descriptions.  For  example,  if  the  current  rule  for  inclusivity  contains  a  lateral 
head  movement  but  one  wishes  to  add  a  brow  raise  to  it,  he  or  she  simply  needs 
to  add  lines  to  the  file  storing  the  behavior  generation  rules,  which  will  call  the 
behavior  description  for  brow  raise.  This  separation  also  supports  supports  spe¬ 
cialization  of  behavior  according  to  individual  or  cultural  traits.  For  example, 
we  can  have  different  rules  for  inclusivity  based  on  culturally-specific  gesturing 
tendencies. 

Using  XSL  to  represent  the  behavior  descriptions  and  behavior  generation 
rules  also  allows  the  user  to  make  modifications  without  knowing  the  details  of 
the  nonverbal  behavior  generator.  There  is  no  need  to  have  other  programming 
language  skills  or  study  how  the  behavior  generator  is  implemented.  By  learning 
simple  patterns  on  how  to  add  XSLT  templates,  one  can  create,  modify  or  delete 
behavior  descriptions  and  rules. 


5  Conclusion  and  Future  Work 

We  have  developed  a  framework  for  text-to-speech  nonverbal  behavior  gener¬ 
ation.  It  analyzes  the  syntactic  and  semantic  structure  of  the  input  text  and 
generates  appropriate  head  movements,  facial  expressions,  and  body  gestures. 
We  studied  a  number  of  video  clips  to  develop  rules  that  map  specific  words, 
phrases,  or  speech  acts  and  constructed  our  behavior  generation  rules  accord¬ 
ing  to  this.  The  behavior  generator  is  designed  to  be  easy  for  users  to  modify  or 
create  behavior  descriptions  and  behavior  generation  rules.  The  module  was  suc¬ 
cessfully  incorporated  into  the  SASO  and  SmartBody  system,  using  the  SAIBA 
markup  structure,  and  works  in  real  time.  It  has  also  been  fielded  in  a  cultural 
training  application  being  developed  at  the  Institute  for  Creative  Technology. 

Much  work  still  remains  to  improve  the  system.  Our  next  step  would  be  to 
evaluate  the  system  and  the  behaviors  generated.  We  are  particularly  interested 
in  the  user’s  responses  to  the  behaviors  and  what  they  infer  from  the  behav¬ 
iors.  We  expect  our  current  rules  are  too  limited  and  overly  general  in  their 
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applicability.  Thus,  we  are  also  seeking  ways  to  use  various  machine  learning 
techniques  to  aid  us  in  the  process  of  rule  generation.  One  straightforward  ap¬ 
proach  would  be  to  learn  the  mapping  between  bigrams  or  trigrams  of  words  to 
gestures.  This  would  require  a  large  gesture  corpora;  however  a  suitable  corpora 
for  our  work  is  currently  not  available.  In  the  absence  of  a  large  corpora,  we 
rather  expect  the  learning  should  be  informed  by  higher  level  features  such  as 
syntactic,  lexical,  and  semantic  structure  of  the  utterance  or  the  ECA’s  emo¬ 
tional  state,  similar  to  what  we  used  to  craft  the  rules  by  hand. 

Furthermore,  we  would  like  to  modify  the  nonverbal  behavior  generation  given 
the  information  on  ECA’s  supposed  gender,  age,  culture,  or  personality.  The 
system  also  lacks  a  good  knowledge  base  of  the  environment  in  which  the  EGA 
resides.  A  tight  connection  to  the  knowledge  base  of  the  objects  and  agents 
in  the  virtual  world  will  allow  the  EGA  to  have  more  sophisticated  behaviors 
such  as  deictic  gestures  that  correctly  points  at  the  object  referred.  Finally,  we 
would  like  to  model  the  affective  state  of  the  user  interacting  with  the  EGA  and 
generate  appropriate  behaviors  that  respond  not  only  to  agent’s  emotions  but 
also  to  the  user’s  emotions. 
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